import pandas as pd
data = {
'Category': ['A','B','C','C','B','A']
}
df = pd.DataFrame(data)Handling Categorical Data with Pandas
This notebook covers essential techniques for working with categorical data in Pandas, including: - Encoding Methods: Converting categorical variables to numerical formats - Grouping Operations: Analyzing category distributions and aggregations - Data Transformation: Reshaping data with melt and pivot operations
Categorical data transformation is crucial for machine learning models that require numerical inputs.
1. Setting Up Sample Data
Let’s start by creating a sample DataFrame with categorical data to work with.
df| Category | |
|---|---|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | C |
| 4 | B |
| 5 | A |
2. Encoding Categorical Data
Machine learning algorithms typically require numerical inputs. Categorical encoding converts text categories into numbers. Here are the most common techniques:
One-Hot Encoding
One-hot encoding creates binary columns for each category. It’s ideal for nominal (unordered) categories.
Pros: No ordinal assumptions, works well with most algorithms Cons: Can create many columns (curse of dimensionality)
pd.get_dummies(df['Category'])[['A','B']]| A | B | |
|---|---|---|
| 0 | True | False |
| 1 | False | True |
| 2 | False | False |
| 3 | False | False |
| 4 | False | True |
| 5 | True | False |
Label Encoding
Label encoding assigns integer values to categories. Use this when categories have a natural order (ordinal data).
Pros: Memory efficient, preserves single column Cons: Implies ordinal relationship even when none exists
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Category_LabenEncoded'] = label_encoder.fit_transform(df['Category'])
df| Category | Category_LabenEncoded | |
|---|---|---|
| 0 | A | 0 |
| 1 | B | 1 |
| 2 | C | 2 |
| 3 | C | 2 |
| 4 | B | 1 |
| 5 | A | 0 |
import pandas as pd
data = {
'Category': ['A','B','C','C','B','A']
}
df = pd.DataFrame(data)
df| Category | |
|---|---|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | C |
| 4 | B |
| 5 | A |
3. Analyzing Categorical Data with Grouping
Grouping operations help you understand the distribution and patterns in your categorical data. This is essential for exploratory data analysis.
Counting Category Frequencies
Use groupby().size() or groupby().count() to see how many times each category appears.
df.groupby('Category').size()Category
A 2
B 2
C 2
dtype: int64
df.groupby('Category').agg({'Category':'count'})| Category | |
|---|---|
| Category | |
| A | 2 |
| B | 2 |
| C | 2 |
4. Data Transformation: Reshaping with Melt and Pivot
Data reshaping is crucial for transforming your data between “wide” and “long” formats. This is particularly useful when working with categorical data across multiple variables.
Wide to Long Format (melt)
pd.melt() unpivots a DataFrame from wide format to long format. This is useful for: - Converting multiple categorical columns into a single column - Preparing data for visualization libraries - Making data more database-friendly
# Reshaping Data
data = {
'Name': ['John', 'Emily', 'Kate'],
'Math': [90, 85,88],
'Science': [92, 80, 95]
}
df = pd.DataFrame(data)
df| Name | Math | Science | |
|---|---|---|---|
| 0 | John | 90 | 92 |
| 1 | Emily | 85 | 80 |
| 2 | Kate | 88 | 95 |
df_melted = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score')
df_melted| Name | Subject | Score | |
|---|---|---|---|
| 0 | John | Math | 90 |
| 1 | Emily | Math | 85 |
| 2 | Kate | Math | 88 |
| 3 | John | Science | 92 |
| 4 | Emily | Science | 80 |
| 5 | Kate | Science | 95 |
Long to Wide Format (pivot)
df.pivot() does the opposite of melt - it converts long format back to wide format. This is useful for: - Creating summary tables - Preparing data for certain types of analysis - Making data more human-readable
df_melted.pivot(index='Name', columns='Subject', values='Score')| Subject | Math | Science |
|---|---|---|
| Name | ||
| Emily | 85 | 80 |
| John | 90 | 92 |
| Kate | 88 | 95 |
Summary
In this notebook, you learned essential data transformation techniques for categorical data:
- Encoding: Convert text categories to numbers
- One-hot encoding for nominal data
- Label encoding for ordinal data
- Grouping: Analyze category distributions
- Count frequencies with
groupby().size() - Aggregate data by categories
- Count frequencies with
- Reshaping: Transform data structure
melt(): Wide to long formatpivot(): Long to wide format
These techniques form the foundation of data preprocessing for machine learning and analysis workflows. Choose the right method based on your data characteristics and modeling requirements!
Next Steps: Practice with real datasets and explore advanced encoding techniques like target encoding or frequency encoding.